File-type Identification with Incomplete Information
نویسندگان
چکیده
File-type Identification (FTI) is an important problem in digital forensics, intrusion detection, and other related fields. Using stateof-the-art classification techniques to solve FTI problems has begun to receive research attention; however, general conclusions have not been reached due to the lack of thorough evaluations for method comparison. This paper presents a systematic investigation of the problem, algorithmic solutions and an evaluation methodology. Our focus is on performance comparison of statistical classifiers (e.g., SVM and kNN) and knowledgebased approaches, especially COTS (Commercial Off-The-Shelf) solutions which currently dominate FTI applications. We analyze the robustness of different methods in handling damaged files and file segments. We propose two alternative criteria in measuring performance: 1) treating file-name extensions as the true labels, and 2) treating the predictions by knowledge based approaches on intact files; these rely on signature bytes as the true labels (and removing these signature bytes before testing each method). In our experiments with simulated damages in files, SVM and kNN substantially outperform all the COTS solutions we tested, improving classification accuracy very substantially – some COTS methods cannot identify damaged files at all. Our experiments also show the scalability of SVM and kNN to large applications after adequate feature selection.
منابع مشابه
اثر اطلاعات پدری گمشده برپیشرفت و روند ژنتیکی صفت کمی با استفاده از شبیه سازی رایانه ای
In order to study the effect of incomplete sire's pedigree on genetic trend (bBv,y) and gain (R) of quantitative trait, two population were simulated with the heritability 0.15 and 0.30. For each population, information resulted from ten years of selection were saved in different files. In generated data files, the sire numbers were eliminated from pedigree file with 0, 10, 20, …, 100 percentag...
متن کاملSÁDI - Statistical Analysis for Data Type Identification
A key task in digital forensic analysis is the location of relevant information within the computer system. Identification of the relevancy of data is often dependent upon the identification of the type of data being examined. Typical file type identification is based upon file extension or magic keys. These typical techniques fail in many typical forensic analysis scenarios such as needing to ...
متن کاملExtensions of the UNIX File Command and Magic File for File Type Identification
File format identification is a core requirement for digital archives. The UNIX file command is among the most promising technologies for file type identification. This report describes extensions to the file command and magic file that enhance their utility for file format identification in archival systems. A File Format Library (database) has been created to manage information about file for...
متن کاملIncidence of incomplete excision in surgically treated basal cell carcinomas and identification of the related risk factors
Background: Surgery is the most frequent treatment modality for basal cell carcinoma but in spite of its high cure rate, the frequency of incomplete excision varies widely (0.7-50%) among dermatologic centers. Our case series was designed to determine the frequency of incompletely excised basal cell carcinoma and the related risk factors. Methods: A total of 1424 basal cell carcinoma (1040 pati...
متن کاملSliding Window Measurement for File Type Identification
Knowing the file type associated with a sequence of bytes makes interpretation of those bytes far more meaningful. With the ever increasing number of file types in existence and the massive storage capacity of modern hardware, it is impractical to try interpreting a sequence of bytes as every known file type until one succeeds. Furthermore, some file types require specific header or footer info...
متن کامل